Introduction

This exploratory data analysis uses the rich insights provided by the Behavioral Risk Factor Surveillance System (BRFSS) dataset, which is sourced from the Centers for Disease Control (CDC). According to the CDC, the BRFSS, established in 1984 and now includes all 50 states, the District of Columbia, and three U.S. territories, stands as the nation’s premier of health-related telephone surveys in the US. With over 400,000 annual adult interviews, it stands as the largest continuous health survey system globally, capturing the trends of health-related risk behaviors among US residents.

The BRFSS dataset contains 8,887 rows of health-related answers from participants within the time frame of 2017, 2018, 2021, and 2022. Included in this dataset are 36 informative columns that state the abbreviations of questions asked to survey takers. This includes statuses on general health, check-ups, blood pressure medication, asthma, diabetes, marital, education, employment, income, difficulty walking, smoking, age, race, state of residency, US region etc. In addition, the dataset contains information on the amount of poor physical and mental health days, body max index (BMI), and year. The inclusion of different kinds of data, ranging from categorical indicators and numerical metrics, showcases a comprehensive picture of health dynamics across diverse demographics and geographical regions.

This analysis will only focus on the columns with information regarding year, age range, BMI, smoker status, general health status, US region, and the amount of poor mental and physical health days. The preparation for analyzing this dataset included simplifying and renaming columns for comprehensive purposes, removing columns that aren’t relevant to this exploration, and converting the “year” column into numeric to allow for easier data reading and outputs. Rows with missing data (NAs) were removed from the dataset before each analysis question was explored, not at the beginning when the dataset was initially prepared. This approach ensures that no important data related to the specific variable being studied is lost. Additionally, new data frames and columns were created for this EDA. This report will include a summary of the dataset, the creation and calculations of new data frames and columns, and analysis of the following:

By carefully exploring and analyzing the dataset, insightful information can be discovered that can improve public health programs and the overall well-being of communities across the nation.

Data Analysis

Disparities of Smoker Status Between Age Groups (2017-2018 & 2021-2022)

Figures 1, 2, 3, and 4 depict the frequency distribution of age groups to the smoker status per year (2017, 2018, 2021, and 2022) sourced from the BRFSS dataset. Current smokers are represented by pink bars, while former smokers are represented by blue bars. Each bar spans an age range from 18 to 65+, and its height indicates the count of current/former smokers within the corresponding age group.

Looking at Figure 1, reveals the correlation between age groups and the prevalence of current smokers in 2017, which tends to increase with age. Notably, 1415 individuals aged 65+ identified as current smokers, while 181 individuals aged 18-24 fell into the same category. The frequency of current smokers rises as age groups progress: 18-24 (181 individuals), 25-34 (353 individuals), 35-44 (402 individuals), 45-54 (499 individuals), 55-64 (736 individuals). In 2017, former smokers exhibit lower prevalence, totaling 595 individuals, compared to 3,586 current smokers. The count of former smokers slightly increases across age groups 18-24 (32 individuals), 25-34 (85 individuals), 45-54 (94 individuals), and 55-64 (175 individuals). However, a minor decline of 57 is seen in the frequency of former smokers within the 65+ age group (118 individuals). This makes 55-64 the age group with the most count of former smokers.

Figure 2 reveals similar trends regarding the relationship between age groups and the prevalence of current smokers in 2018. As observed, the frequency of current smokers tends to increase with age. Notably, among the age groups surveyed, 35 individuals aged 65+ identified as current smokers from this dataset in 2018, making them the age group with the greatest number of current smokers.

The frequency of current smokers demonstrates an upward trend as age groups progress: 18-24 (4 individuals), 25-34 (10 individuals), 35-44 (14 individuals), 45-54 (17 individuals), 55-64 (25 individuals). In 2018, former smokers exhibit lower prevalence, totaling 25 individuals, compared to 107 current smokers. The count of former smokers slightly increases across age groups 35-44 (5 individuals) and 45-54 (9 individuals). However, a decline is noted within the 55-64 age group (2 individuals), followed by another increase in the 65+ age group (4 individuals). This makes 45-54 the age group with the most count of former smokers.

Similar observations can be made for Figure 3 as the previous figures. In 2021, current smokers (3, 279 individuals) are more prevalent compared to former smokers (505 individuals). The frequency of current smokers tends to increase with age. Notably, among the age groups surveyed, 1,293 individuals aged 65+ identified as current smokers from the dataset in 2021. This makes them the age group with the most count of current smokers.

The frequency of current smokers shows an upward trend as age groups increase: 18-24 (207 individuals), 25-34 (365 individuals), 35-44 (406 individuals), 45-54 (481 individuals), 55-64 (627 individuals). Similarly, the count of former smokers slightly increases across age groups: 18-24 (11 individuals), 25-34 (66 individuals), and 35-44 (93 individuals). However, a decline is noted within the 45-54 age group (90 individuals) followed by another increase within the 55-64 age group (128 individuals). Then, another decline occurs in the 65+ age group (117 individuals). This makes 55-64 the age group with the most count of former smokers.

In 2022 (Figure 4), the trends slightly change, but current smokers (177 individuals) continue to be more prevalent than former smokers (37 individuals). The frequency of current smokers tends to increase within age groups 18-24 (15 individuals), 25-34 (22 individuals), 35-44 (28 individuals). This was followed by a decrease within the age groups 45-54 (28 individuals) and 55-64 (22 individuals). Notably, among the age groups surveyed, 64 individuals aged 65+ identified as current smokers from the dataset in 2021. This makes them the age group with the most count of current smokers. Similarly, the count of former smokers slightly increases across age groups: 18-24 (3 individuals), 25-34 (5 individuals), and 35-44 (6 individuals). However, a decline is noted within the 45-54 age group (4 individuals) followed by another increase within the 55-64 age group (11 individuals). Then, another decline occurs in the 65+ age group (8 individuals). This makes 55-64 the age group with the most count of former smokers.

BMI & Smoker Status Correlation

Figures 5 and 6 present the frequency distribution of body mass index (BMI) to the count of current and former smokers. A notable observation from these histograms is that they are not normal distributions; instead, they exhibit a right skew. This means that the peak of the graph leans towards the left side of the x-axis, with fewer frequencies observed towards the right.  

Figure 5 highlights a distinct right-skewed frequency distribution of BMI among former smokers. In the dataset (Table 1), the average BMI for former smokers is 2,755.75 with a standard deviation of 580.11. As BMI surpasses 3,600, the frequency distribution decreases, especially beyond a BMI of 4,500.  

Similarly, Figure 6 also shows a right-skewed frequency distribution. Calculations from Table 1 reveal that the average BMI for current smokers is 2,830.16 with a standard deviation of 634.76. In this histogram, the frequency distribution begins to decrease beyond a BMI of 4,000, especially after 5,300.  

Table 1

Summary Stats of BMI in Relation to Smoker Status

Smoker_Status

Mean_BMI

SD_BMI

Min_BMI

Max_BMI

Total_Count

Current smoker

2,830.16

634.76

1,288.00

7,113.00

3,369

Former smoker

2,755.75

580.11

1,673.00

5,234.00

550

These observations emphasize the skewed nature of the BMI distribution among current and former smokers, showcasing a clear inclination towards lower BMI values with rare occurrences of higher values. These rare occurrences of higher values are representative of outliers, which can be seen in Figure 7. The box plot shows that in the current smoker data frame there is an outlier with a BMI of 1,304 and others that surpass 4,207. For former smokers, there are only outliers after a BMI of 4,184. However, the frequency of the outliers in former smokers is less than that of current smokers. Additionally, among current smokers BMI’s, the 25th quartile is 2,403, the 50th quartile is 2,732, and the 75th quartile is 3,125. Compared to former smokers’ BMIs, the 25th quartile stands at 2,352, the 50th quartile at 2,663, and the 75th quartile at 3,101.

US Regional Disparities in Poor Mental Health Days

Observing Figure 8 reveals comparable averages in poor mental health days across various US regions. According to Table 2, the average number of poor mental health days per month is about 4 days across the Midwest, Northeast, South, and West regions. This suggests that survey takers reported an average of 4 poor mental health days within their region of residency. Each region exhibits a standard deviation as follows: Midwest (7.6 days), Northeast (7.7 days), South (8.4 days), and West (7.7 days). In the box plot (Figure 8), outliers are noticeable for each US region beyond 8 days of poor mental health. Furthermore, the frequency distribution for each region decreases after 7 days of poor mental health.

Table 2

Summary Stats of Poor Mental Health Days in Relation to US Region

Region

Mean_BMHD

SD_BMHD

Min_BMHD

Max_BMHD

Total_BMHD

Midwest

3.50

7.60

0.00

30.00

2,559

Northeast

3.60

7.70

0.00

30.00

1,715

South

4.00

8.40

0.00

30.00

2,386

West

3.70

7.70

0.00

30.00

1,916

Health Status Disparities in Poor Health Days

Figure 9 shows a box plot representing the disparities in poor health days across different general health statuses. Table 3 outlines the averages (avg.) and standard deviations of poor health days by general health status. The average and standard deviation (SD) of poor health days (PHD) for each general health status are as follows: excellent health (avg. 2 PHD, sd. 5 PHD), very good health (avg. 2 PHD, sd. 5 PHD), good health (avg. 4 PHD, sd. 7 PHD), fair health (avg. 9 PHD, sd. 11 PHD), poor health (avg. 18 PHD, sd. 13 PHD). The average findings suggest that survey takers reported an average of 2 poor health days for excellent and very good health, 4 poor health days for good health, 9 poor health days for fair health, and 18 poor health days for poor health.

Table 3

Summary Stats of Poor Health Days in Relation to General Health Days

General_Health_Status

Mean_PHD

SD_PHD

Min_PHD

Max_PHD

Total_Count

Excellent

2.00

5.00

0.00

30.00

436

Very good

2.00

5.00

0.00

30.00

1,360

Good

4.00

7.00

0.00

30.00

1,520

Fair

9.00

11.00

0.00

30.00

909

Poor

18.00

13.00

0.00

30.00

339

Furthermore, Figure 9 highlights a distinctive right-skewed frequency distribution of general health statuses (excellent, very good, good, and fair) among the number of poor health days. However, poor health suggests a left-skewed frequency distribution among the number of poor health days. For excellent health, the frequency distribution starts to decline after 2 days of poor health; for very good health, it starts to decrease after 5 days of poor health; for good health, it decreases after 10 days of poor health; and for fair health, it decreases after 15 days of poor health. In contrast, the frequency distribution for poor health starts to increase after 2 days of poor health.

Summary

The findings in this exploratory data analysis can provide further insight into the disparities in smoker status across age groups, the correlation between BMI and smoker status, regional disparities in poor mental health days and health status disparities reflected in poor health days. These findings can assist in improving public health programs and the overall well-being of communities across the nation. Differences in smoker status across age groups indicate a rising trend in current smokers with advancing age, whereas former smokers demonstrate varying prevalence levels. Furthermore, the correlation between BMI and smoker status highlights a skewed distribution toward lower BMI values for both current and former smokers. Based on these findings, more research can be done on BMI values for current and former smokers regarding age groups. Also, another analysis can be done on the correlation between lung cancer patients and age groups.  

Additionally, regional differences in poor mental health days across the US reveal similar averages (4 days) with outliers detected beyond 8 days of poor mental health. These findings suggest that perhaps where one lives doesn’t affect one’s mental health. It would be interesting to further investigate in which state people experience the least amount of poor mental health days.  

The results in the disparities in health status reflected by poor health days, showcase right skewed frequency distributions for excellent, very good, good, and fair status. This suggests that those who experience fewer poor health days (avg. 2 -9 PHD) within a 30-day period will be in better health than those who had a poor health status (avg. 18 PHD). Hence, the reason why the poor health box plot is left skewed. Individuals who participate in more frequently poor health days tend to fall in the poor health status category.  

Based on these findings from the BRFSS dataset, further exploration questions are: 

By studying and analyzing the BRFSS dataset, valuable insights can be revealed that can help guide and improve health programs nationwide.

Resources

Bluman, A. (2018). Elementary statistics: A step by step approach (10th ed.). McGraw Hill. Goodreads. (n.d.).

Centers for Disease Control and Prevention. (2024, January 9). CDC - BRFSS. Centers for Disease Control and Prevention. https://www.cdc.gov/brfss/index.html

Kabacoff, R.I. (2022). R in action: Data analysis and graphics with R and tidyverse (3rd edition).

Koloski, D. (March 3, 2024). ALY6010 Sunday Office Hours. Teams.com. https://nam12.safelinks.protection.outlook.com/ap/t-59584e83/?url=https%3A%2F%2Fteams.microsoft.com%2Fl%2Fmeetup-join%2F19%253ameeting_ZTNiNGQ3MzQtZDM0OS00MWM4LTg4ZjAtOTEyMjRkODJlOWQ5%2540thread.v2%2F0%3Fcontext%3D%257b%2522Tid%2522%253a%2522a8eec281-aaa3-4dae-ac9b-9a398b9215e7%2522%252c%2522Oid%2522%253a%2522e1b4a20f-e33d-4390-b8aa-43497b27857d%2522%257d&data=05%7C02%7Clema.l%40northeastern.edu%7Cf3988040c8ca4975d9d008dc360db549%7Ca8eec281aaa34daeac9b9a398b9215e7%7C0%7C0%7C638444679725288581%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=iBTNP%2Bb5hSLIrypsX3FFaeTbs0b9GutXG1HyNbdU2hY%3D&reserved=0

Koloski, D. (March 5, 2024). ALY6010 Tuesday Office Hours. Teams.com. https://nam12.safelinks.protection.outlook.com/ap/t-59584e83/?url=https%3A%2F%2Fteams.microsoft.com%2Fl%2Fmeetup-join%2F19%253ameeting_ZTNiNGQ3MzQtZDM0OS00MWM4LTg4ZjAtOTEyMjRkODJlOWQ5%2540thread.v2%2F0%3Fcontext%3D%257b%2522Tid%2522%253a%2522a8eec281-aaa3-4dae-ac9b-9a398b9215e7%2522%252c%2522Oid%2522%253a%2522e1b4a20f-e33d-4390-b8aa-43497b27857d%2522%257d&data=05%7C02%7Clema.l%40northeastern.edu%7Cf3988040c8ca4975d9d008dc360db549%7Ca8eec281aaa34daeac9b9a398b9215e7%7C0%7C0%7C638444679725288581%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=iBTNP%2Bb5hSLIrypsX3FFaeTbs0b9GutXG1HyNbdU2hY%3D&reserved=0